AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Import the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
# Import the metrics class from sklearn
from sklearn import metrics
# Import the variance_inflation_factor class from statsmodels.stats.outliers_influence
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Import the train_test_split class from sklearn.model_selection
from sklearn.model_selection import train_test_split
# Import the DecisionTreeClassifier class from sklearn.tree
from sklearn.tree import DecisionTreeClassifier
# Import the tree module from sklearn
from sklearn import tree
# Import the GridSearchCV class from sklearn.model_selection
from sklearn.model_selection import GridSearchCV
# Import the f1_score, accuracy_score, recall_score, precision_score, confusion_matrix, ConfusionMatrixDisplay, and make_scorer functions from sklearn.metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
ConfusionMatrixDisplay,
make_scorer,
roc_auc_score,
roc_curve,
precision_recall_curve,
)
# Load the dataset
df = pd.read_csv('Loan_Modelling.csv')
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
print(f"Number of rows: {df.shape[0] }, Number of columns {df.shape[1]}")
Number of rows: 5000, Number of columns 14
# Data frame information
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# Count the duplicated data
df[df.duplicated()].count()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
# Count null values
df.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
Questions:
# The describe() method returns a DataFrame that contains descriptive statistics for each column of the DataFrame.
df.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
# Generating a box plot
plt.figure(figsize=(15, 7))
sns.boxplot(data=df, x='Mortgage')
<Axes: xlabel='Mortgage'>
Observation
Mortgage has a right skewed distribution with a high percentage of outliers
use_a_credit_card = df[df['CCAvg'] != 0]
has_a_credit_card = df[df['CreditCard'] == 1]
combined_list = set(use_a_credit_card.index).union(has_a_credit_card.index)
print("Customer whit credit card:", len(combined_list))
Customer whit credit card: 4922
# Plot a heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(data=df.corr(), annot=True, cmap='YlGnBu', vmin=-0.2, vmax=1)
<Axes: >
In this case, the correlation matrix shows that the following variables are most closely related:
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True)
tab = pd.crosstab(data[predictor], data[target], normalize="index")
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
plt.figure(figsize=(15, 7))
stacked_barplot(df, 'Age', 'Personal_Loan')
<Figure size 1500x700 with 0 Axes>
There is a relationship between productivity, age, and loans. As people age, they tend to become more productive, which can lead to higher earnings. This, in turn, can make them more likely to be able to afford to take out a loan.
stacked_barplot(df, 'Education', 'Personal_Loan')
# Unique values of all the columns to check values
for column in df.columns:
print('-'*20)
print(column)
print(df[column].unique())
-------------------- ID [ 1 2 3 ... 4998 4999 5000] -------------------- Age [25 45 39 35 37 53 50 34 65 29 48 59 67 60 38 42 46 55 56 57 44 36 43 40 30 31 51 32 61 41 28 49 47 62 58 54 33 27 66 24 52 26 64 63 23] -------------------- Experience [ 1 19 15 9 8 13 27 24 10 39 5 23 32 41 30 14 18 21 28 31 11 16 20 35 6 25 7 12 26 37 17 2 36 29 3 22 -1 34 0 38 40 33 4 -2 42 -3 43] -------------------- Income [ 49 34 11 100 45 29 72 22 81 180 105 114 40 112 130 193 21 25 63 62 43 152 83 158 48 119 35 41 18 50 121 71 141 80 84 60 132 104 52 194 8 131 190 44 139 93 188 39 125 32 20 115 69 85 135 12 133 19 82 109 42 78 51 113 118 64 161 94 15 74 30 38 9 92 61 73 70 149 98 128 31 58 54 124 163 24 79 134 23 13 138 171 168 65 10 148 159 169 144 165 59 68 91 172 55 155 53 89 28 75 170 120 99 111 33 129 122 150 195 110 101 191 140 153 173 174 90 179 145 200 183 182 88 160 205 164 14 175 103 108 185 204 154 102 192 202 162 142 95 184 181 143 123 178 198 201 203 189 151 199 224 218] -------------------- ZIPCode [91107 90089 94720 94112 91330 92121 91711 93943 93023 94710 90277 93106 94920 91741 95054 95010 94305 91604 94015 90095 91320 95521 95064 90064 94539 94104 94117 94801 94035 92647 95814 94114 94115 92672 94122 90019 95616 94065 95014 91380 95747 92373 92093 94005 90245 95819 94022 90404 93407 94523 90024 91360 95670 95123 90045 91335 93907 92007 94606 94611 94901 92220 93305 95134 94612 92507 91730 94501 94303 94105 94550 92612 95617 92374 94080 94608 93555 93311 94704 92717 92037 95136 94542 94143 91775 92703 92354 92024 92831 92833 94304 90057 92130 91301 92096 92646 92182 92131 93720 90840 95035 93010 94928 95831 91770 90007 94102 91423 93955 94107 92834 93117 94551 94596 94025 94545 95053 90036 91125 95120 94706 95827 90503 90250 95817 95503 93111 94132 95818 91942 90401 93524 95133 92173 94043 92521 92122 93118 92697 94577 91345 94123 92152 91355 94609 94306 96150 94110 94707 91326 90291 92807 95051 94085 92677 92614 92626 94583 92103 92691 92407 90504 94002 95039 94063 94923 95023 90058 92126 94118 90029 92806 94806 92110 94536 90623 92069 92843 92120 95605 90740 91207 95929 93437 90630 90034 90266 95630 93657 92038 91304 92606 92192 90745 95060 94301 92692 92101 94610 90254 94590 92028 92054 92029 93105 91941 92346 94402 94618 94904 93077 95482 91709 91311 94509 92866 91745 94111 94309 90073 92333 90505 94998 94086 94709 95825 90509 93108 94588 91706 92109 92068 95841 92123 91342 90232 92634 91006 91768 90028 92008 95112 92154 92115 92177 90640 94607 92780 90009 92518 91007 93014 94024 90027 95207 90717 94534 94010 91614 94234 90210 95020 92870 92124 90049 94521 95678 95045 92653 92821 90025 92835 91910 94701 91129 90071 96651 94960 91902 90033 95621 90037 90005 93940 91109 93009 93561 95126 94109 93107 94591 92251 92648 92709 91754 92009 96064 91103 91030 90066 95403 91016 95348 91950 95822 94538 92056 93063 91040 92661 94061 95758 96091 94066 94939 95138 95762 92064 94708 92106 92116 91302 90048 90405 92325 91116 92868 90638 90747 93611 95833 91605 92675 90650 95820 90018 93711 95973 92886 95812 91203 91105 95008 90016 90035 92129 90720 94949 90041 95003 95192 91101 94126 90230 93101 91365 91367 91763 92660 92104 91361 90011 90032 95354 94546 92673 95741 95351 92399 90274 94087 90044 94131 94124 95032 90212 93109 94019 95828 90086 94555 93033 93022 91343 91911 94803 94553 95211 90304 92084 90601 92704 92350 94705 93401 90502 94571 95070 92735 95037 95135 94028 96003 91024 90065 95405 95370 93727 92867 95821 94566 95125 94526 94604 96008 93065 96001 95006 90639 92630 95307 91801 94302 91710 93950 90059 94108 94558 93933 92161 94507 94575 95449 93403 93460 95005 93302 94040 91401 95816 92624 95131 94965 91784 91765 90280 95422 95518 95193 92694 90275 90272 91791 92705 91773 93003 90755 96145 94703 96094 95842 94116 90068 94970 90813 94404 94598] -------------------- Family [4 3 1 2] -------------------- CCAvg [ 1.6 1.5 1. 2.7 0.4 0.3 0.6 8.9 2.4 0.1 3.8 2.5 2. 4.7 8.1 0.5 0.9 1.2 0.7 3.9 0.2 2.2 3.3 1.8 2.9 1.4 5. 2.3 1.1 5.7 4.5 2.1 8. 1.7 0. 2.8 3.5 4. 2.6 1.3 5.6 5.2 3. 4.6 3.6 7.2 1.75 7.4 2.67 7.5 6.5 7.8 7.9 4.1 1.9 4.3 6.8 5.1 3.1 0.8 3.7 6.2 0.75 2.33 4.9 0.67 3.2 5.5 6.9 4.33 7.3 4.2 4.4 6.1 6.33 6.6 5.3 3.4 7. 6.3 8.3 6. 1.67 8.6 7.6 6.4 10. 5.9 5.4 8.8 1.33 9. 6.7 4.25 6.67 5.8 4.8 3.25 5.67 8.5 4.75 4.67 3.67 8.2 3.33 5.33 9.3 2.75] -------------------- Education [1 2 3] -------------------- Mortgage [ 0 155 104 134 111 260 163 159 97 122 193 198 285 412 153 211 207 240 455 112 336 132 118 174 126 236 166 136 309 103 366 101 251 276 161 149 188 116 135 244 164 81 315 140 95 89 90 105 100 282 209 249 91 98 145 150 169 280 99 78 264 113 117 325 121 138 77 158 109 131 391 88 129 196 617 123 167 190 248 82 402 360 392 185 419 270 148 466 175 147 220 133 182 290 125 124 224 141 119 139 115 458 172 156 547 470 304 221 108 179 271 378 176 76 314 87 203 180 230 137 152 485 300 272 144 94 208 275 83 218 327 322 205 227 239 85 160 364 449 75 107 92 187 355 106 587 214 307 263 310 127 252 170 265 177 305 372 79 301 232 289 212 250 84 130 303 256 259 204 524 157 231 287 247 333 229 357 361 294 86 329 142 184 442 233 215 394 475 197 228 297 128 241 437 178 428 162 234 257 219 337 382 397 181 120 380 200 433 222 483 154 171 146 110 201 277 268 237 102 93 354 195 194 238 226 318 342 266 114 245 341 421 359 565 319 151 267 601 567 352 284 199 80 334 389 186 246 589 242 143 323 535 293 398 343 255 311 446 223 262 422 192 217 168 299 505 400 165 183 326 298 569 374 216 191 408 406 452 432 312 477 396 582 358 213 467 331 295 235 635 385 328 522 496 415 461 344 206 368 321 296 373 292 383 427 189 202 96 429 431 286 508 210 416 553 403 225 500 313 410 273 381 330 345 253 258 351 353 308 278 464 509 243 173 481 281 306 577 302 405 571 581 550 283 612 590 541] -------------------- Personal_Loan [0 1] -------------------- Securities_Account [1 0] -------------------- CD_Account [0 1] -------------------- Online [0 1] -------------------- CreditCard [0 1]
Column ID is not relevant, same information as index
df = df.drop('ID', axis=1)
We can observe some wrong values in the Experience column
df[df['Experience'] < 0]['Experience'].unique()
array([-1, -2, -3])
# Correcting the wrong values
df['Experience'].replace(-1, 1, inplace=True)
df['Experience'].replace(-2, 2, inplace=True)
df['Experience'].replace(-3, 3, inplace=True)
Data distribution
# Print the data distribution of each column of the dataframe
for column in df.columns:
plt.figure(figsize=(15, 7))
sns.histplot(data=df, x=column, kde=True)
Outliers
# Print the outliers distribution
for column in ['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Mortgage']:
plt.figure(figsize=(15, 7))
sns.boxplot(data=df, x=column)
Outliers treatment
# Find the first quartile, the third quartile, and the interquartile range
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
# Inter Quantile Range (75th perentile - 25th percentile)
IQR = Q3 - Q1
# Define the lower and upper limits of the normal data range
lower_limit = Q1 - 1.5 * IQR
upper_limit = Q3 + 1.5 * IQR
# Percentage of outliers by column
((df < lower_limit) | (df > upper_limit)).sum() / len(df) * 100
Age 0.00 Experience 0.00 Income 1.92 ZIPCode 0.00 Family 0.00 CCAvg 6.48 Education 0.00 Mortgage 5.82 Personal_Loan 9.60 Securities_Account 10.44 CD_Account 6.04 Online 0.00 CreditCard 0.00 dtype: float64
Personal_Loan, Securities_Account, CD_Account are Yes/No columns and the values will not be taken into account
Mortgage and CCavg have high percentage of outliers
Removing outliers from Mortgage column
# Identify any data points that fall outside the normal data range
outliersMortgage = df[df['Mortgage'] < lower_limit['Mortgage']
] | df[df['Mortgage'] > upper_limit['Mortgage']]
# Remove the outliers from the dataframe
df = df.drop(outliersMortgage.index, axis=0)
df.shape
(4709, 13)
Checking the outliers percentage
# Percentage of outliers by column
((df < lower_limit) | (df > upper_limit)).sum() / len(df) * 100
Age 0.000000 Experience 0.000000 Income 1.571459 ZIPCode 0.000000 Family 0.000000 CCAvg 5.776173 Education 0.000000 Mortgage 0.000000 Personal_Loan 8.218305 Securities_Account 10.469314 CD_Account 5.415162 Online 0.000000 CreditCard 0.000000 dtype: float64
CCAvg still have a high percentage of outliers
Removing the outliers from CCAvg column
# Identify any data points that fall outside the normal data range
outliersCCAvg = df[df['CCAvg'] < lower_limit['CCAvg']
] | df[df['CCAvg'] > upper_limit['CCAvg']]
# Remove the outliers from the dataframe
df = df.drop(outliersCCAvg.index, axis=0)
df.shape
(4437, 13)
# Percentage of outliers by column
((df < lower_limit) | (df > upper_limit)).sum() / len(df) * 100
Age 0.000000 Experience 0.000000 Income 0.878972 ZIPCode 0.000000 Family 0.000000 CCAvg 0.000000 Education 0.000000 Mortgage 0.000000 Personal_Loan 6.558485 Securities_Account 10.434979 CD_Account 4.778003 Online 0.000000 CreditCard 0.000000 dtype: float64
Data preparation for modeling
# Separate independent and dependent variable
x = df.drop(["Personal_Loan"], axis=1)
y = df["Personal_Loan"]
# Splitting data into training and test set:
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=1)
# Printing information about train and test data
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
print("Percentage of classes in training set:",
y_train.value_counts(normalize=True))
print("Percentage of classes in test set:",
y_test.value_counts(normalize=True))
Number of rows in train data = 3105 Number of rows in test data = 1332 Percentage of classes in training set: Personal_Loan 0 0.937842 1 0.062158 Name: proportion, dtype: float64 Percentage of classes in test set: Personal_Loan 0 0.926426 1 0.073574 Name: proportion, dtype: float64
Predicting a customer will take a loan when they actually will not
Predicting a customer will not take a loan when they actually will
recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.# fitting logistic regression model
logit = sm.Logit(y_train, x_train)
lg = logit.fit(disp=False)
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: Personal_Loan No. Observations: 3105
Model: Logit Df Residuals: 3093
Method: MLE Df Model: 11
Date: Sat, 10 Jun 2023 Pseudo R-squ.: 0.6512
Time: 00:00:09 Log-Likelihood: -252.23
converged: True LL-Null: -723.04
Covariance Type: nonrobust LLR p-value: 6.929e-195
======================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------
Age -0.2454 0.096 -2.556 0.011 -0.434 -0.057
Experience 0.2358 0.096 2.460 0.014 0.048 0.424
Income 0.0600 0.004 13.903 0.000 0.052 0.068
ZIPCode -9.664e-05 2.69e-05 -3.591 0.000 -0.000 -4.39e-05
Family 0.6682 0.118 5.676 0.000 0.437 0.899
CCAvg 0.8211 0.100 8.178 0.000 0.624 1.018
Education 1.7559 0.178 9.841 0.000 1.406 2.106
Mortgage -0.0020 0.002 -1.184 0.236 -0.005 0.001
Securities_Account -1.6790 0.485 -3.459 0.001 -2.630 -0.728
CD_Account 4.8274 0.556 8.678 0.000 3.737 5.918
Online -0.9929 0.255 -3.901 0.000 -1.492 -0.494
CreditCard -1.5420 0.332 -4.644 0.000 -2.193 -0.891
======================================================================================
Possibly complete quasi-separation: A fraction 0.19 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
The model has a pseudo R-squared of 0.6512, which indicates that the model explains 65.12% of the variation in the dependent variable, Personal_Loan. This is a relatively good fit for a logistic regression model.
The coefficients for Age, Experience, Income, ZIPCode, Family, CCAvg, Education, and Mortgage are all statistically significant. This means that these variables are all associated with the probability of taking out a personal loan.
Negative values of the coefficient show that the probability of a person of taking out a personal loan decreases with the increase of the corresponding attribute value.
Positive values of the coefficient show that the probability of a person of taking out a personal loan increases with the increase of the corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred_temp = model.predict(predictors) > threshold
# rounding off the above values to get classes
pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1, },
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
y_pred = model.predict(predictors) > threshold
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) +
"\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
confusion_matrix_statsmodels(lg, x_train, y_train)
Train performance
model_performance_classification_statsmodels(lg, x_train, y_train)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970048 | 0.647668 | 0.833333 | 0.728863 |
Test performance
model_performance_classification_statsmodels(lg, x_test, y_test)
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.95045 | 0.5 | 0.742424 | 0.597561 |
There are different ways of detecting (or testing for) multicollinearity. One such way is using the Variation Inflation Factor (VIF).
General Rule of thumb:
vif_series = pd.Series(
[variance_inflation_factor(x_train.values, i)
for i in range(x_train.shape[1])],
index=x_train.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: Age 1321.272418 Experience 328.767236 Income 5.279677 ZIPCode 378.160349 Family 5.610381 CCAvg 3.828702 Education 6.770196 Mortgage 1.324530 Securities_Account 1.284347 CD_Account 1.349366 Online 2.611648 CreditCard 1.554974 dtype: float64
x_train1 = x_train.drop('Age', axis=1)
x_test1 = x_test.drop('Age', axis=1)
vif_series = pd.Series(
[variance_inflation_factor(x_train1.values, i)
for i in range(x_train1.shape[1])],
index=x_train1.columns,
dtype=float,
)
print("Series before feature selection: \n\n{}\n".format(vif_series))
Series before feature selection: Experience 4.120057 Income 5.277348 ZIPCode 22.330119 Family 5.603378 CCAvg 3.810217 Education 6.420808 Mortgage 1.324524 Securities_Account 1.282804 CD_Account 1.346788 Online 2.611264 CreditCard 1.554973 dtype: float64
logit1 = sm.Logit(y_train, x_train1)
lg1 = logit1.fit(disp=False)
Training performance
log_reg_model_train_perf = model_performance_classification_statsmodels(
lg1, x_train1, y_train)
log_reg_model_train_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.97037 | 0.65285 | 0.834437 | 0.732558 |
Test performance
log_reg_model_test_perf = model_performance_classification_statsmodels(
lg1, x_test1, y_test)
log_reg_model_test_perf
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.951952 | 0.5 | 0.765625 | 0.604938 |
ROC Curve and ROC-AUC
ROC-AUC on training set
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(x_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(x_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" %
logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
Logistic Regression model is giving a good performance on training set.
Optimal threshold using AUC-ROC curve
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(x_train1))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.03589693977783063
Checking model performance on training set
confusion_matrix_statsmodels(
lg1, x_train1, y_train, threshold=optimal_threshold_auc_roc
)
Training performance
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, x_train1, y_train, threshold=optimal_threshold_auc_roc
)
log_reg_model_train_perf_threshold_auc_roc
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.884058 | 0.943005 | 0.34275 | 0.502762 |
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(x_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(x_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" %
logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# creating confusion matrix
confusion_matrix_statsmodels(
lg1, x_test1, y_test, threshold=optimal_threshold_auc_roc)
Test performance
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
lg1, x_test1, y_test, threshold=optimal_threshold_auc_roc
)
log_reg_model_test_perf_threshold_auc_roc
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.869369 | 0.857143 | 0.344262 | 0.491228 |
Precision-Recall Curve
y_scores = lg1.predict(x_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
At the threshold of 0.35, we get balanced recall and precision.
optimal_threshold_curve = 0.35
Checking model performance on training set
confusion_matrix_statsmodels(
lg1, x_train1, y_train, threshold=optimal_threshold_curve)
Training performance
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, x_train1, y_train, threshold=optimal_threshold_curve
)
log_reg_model_train_perf_threshold_curve
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968116 | 0.746114 | 0.742268 | 0.744186 |
Test performance
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
lg1, x_test1, y_test, threshold=optimal_threshold_curve
)
log_reg_model_test_perf_threshold_curve
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.943694 | 0.561224 | 0.632184 | 0.594595 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression-2 Threshold",
"Logistic Regression-3 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression sklearn | Logistic Regression-2 Threshold | Logistic Regression-3 Threshold | |
|---|---|---|---|
| Accuracy | 0.970370 | 0.884058 | 0.968116 |
| Recall | 0.652850 | 0.943005 | 0.746114 |
| Precision | 0.834437 | 0.342750 | 0.742268 |
| F1 | 0.732558 | 0.502762 | 0.744186 |
# Test performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_curve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold (0.5)",
"Logistic Regression-0.035 Threshold",
"Logistic Regression-0.35 Threshold",
]
print("Test performance comparison:")
models_train_comp_df
Test performance comparison:
| Logistic Regression-default Threshold (0.5) | Logistic Regression-0.035 Threshold | Logistic Regression-0.35 Threshold | |
|---|---|---|---|
| Accuracy | 0.951952 | 0.869369 | 0.943694 |
| Recall | 0.500000 | 0.857143 | 0.561224 |
| Precision | 0.765625 | 0.344262 | 0.632184 |
| F1 | 0.604938 | 0.491228 | 0.594595 |
The logistic regressions with different thresholds show that changing the threshold can impact the accuracy, recall, precision, and F1 scores of the model.
The default threshold of 0.5 results in an accuracy of 95.19%, a recall of 50.00%, a precision of 76.56%, and an F1 score of 60.49%. This means that the model correctly predicted 95.19% of the data, but it only correctly identified 50.00% of the positive cases. This is likely because the threshold is too high, and the model is not confident enough to predict positive cases.
The logistic regression with a threshold of 0.035 results in an accuracy of 86.94%, a recall of 85.71%, a precision of 34.43%, and an F1 score of 49.12%. This means that the model correctly predicted 86.94% of the data, but it only correctly identified 85.71% of the positive cases. This is likely because the threshold is too low, and the model is predicting positive cases that are actually negative.
The logistic regression with a threshold of 0.35 results in an accuracy of 94.37%, a recall of 56.12%, a precision of 63.22%, and an F1 score of 59.46%. This means that the model correctly predicted 94.37% of the data, but it only correctly identified 56.12% of the positive cases. This is likely because the threshold is a good balance between being too high and too low, and the model is able to correctly identify both positive and negative cases.
tree_model = DecisionTreeClassifier(
criterion="gini", random_state=1
)
tree_model.fit(x_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1, },
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) +
"\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
confusion_matrix_sklearn(tree_model, x_train, y_train)
Train performance
decision_tree_perf_train = model_performance_classification_sklearn(
tree_model, x_train, y_train)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(tree_model, x_test, y_test)
Test performance
decision_tree_perf_test = model_performance_classification_sklearn(
tree_model, x_test, y_test)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984234 | 0.857143 | 0.923077 | 0.888889 |
feature_names = list(x.columns)
plt.figure(figsize=(20, 30))
tree.plot_tree(tree_model, feature_names=feature_names,
filled=True, fontsize=9, node_ids=True, class_names=True)
plt.show()
tree_model.tree_.node_count
101
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
tree_model.feature_importances_, columns=["Imp"], index=x_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.365668 Education 0.313546 Family 0.118694 CCAvg 0.094199 Age 0.030165 CD_Account 0.024784 ZIPCode 0.024235 Experience 0.009678 Online 0.009392 Mortgage 0.005886 CreditCard 0.002337 Securities_Account 0.001417
importances = tree_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)),
importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# # Choose the type of classifier.
# estimator = DecisionTreeClassifier(random_state=1)
# # Grid of parameters to choose from
# # add from article
# parameters = {'max_depth': np.arange(1, 10),
# 'min_samples_leaf': [1, 2, 5, 7, 10, 15, 20],
# 'max_leaf_nodes': [2, 3, 5, 10],
# 'min_impurity_decrease': [0.001, 0.01, 0.1]
# }
# # Type of scoring used to compare parameter combinations
# acc_scorer = metrics.make_scorer(metrics.recall_score)
# # Run the grid search
# grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
# grid_obj = grid_obj.fit(x_train, y_train)
# # Set the clf to the best combination of parameters
# estimator = grid_obj.best_estimator_
# # Fit the best algorithm to the data.
# estimator.fit(x_train, y_train)
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(6, 15),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
estimator.fit(x_train, y_train) # Complete the code to fit model on train data
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=7,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, min_samples_leaf=7,
random_state=1)confusion_matrix_sklearn(estimator, x_train, y_train)
Training performance
decision_tree_estimator_train = model_performance_classification_sklearn(
estimator, x_train, y_train)
decision_tree_estimator_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988406 | 0.891192 | 0.919786 | 0.905263 |
confusion_matrix_sklearn(estimator, x_test, y_test)
Test performance
decision_tree_estimator_test = model_performance_classification_sklearn(
estimator, x_test, y_test)
decision_tree_estimator_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.981982 | 0.826531 | 0.920455 | 0.870968 |
plt.figure(figsize=(17, 15))
tree.plot_tree(estimator, feature_names=feature_names,
filled=True, fontsize=9, node_ids=True, class_names=True)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(estimator.feature_importances_, columns=[
"Imp"], index=x_train.columns).sort_values(by='Imp', ascending=False))
# Here we will see that importance of features has increased
Imp Income 0.408582 Education 0.365331 Family 0.131687 CCAvg 0.069162 CD_Account 0.025238 Age 0.000000 Experience 0.000000 ZIPCode 0.000000 Mortgage 0.000000 Securities_Account 0.000000 Online 0.000000 CreditCard 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)),
importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000154 | 0.000615 |
| 2 | 0.000279 | 0.001731 |
| 3 | 0.000297 | 0.002326 |
| 4 | 0.000307 | 0.003554 |
| 5 | 0.000322 | 0.004520 |
| 6 | 0.000334 | 0.006189 |
| 7 | 0.000429 | 0.006619 |
| 8 | 0.000483 | 0.007102 |
| 9 | 0.000491 | 0.007593 |
| 10 | 0.000506 | 0.009617 |
| 11 | 0.000515 | 0.010132 |
| 12 | 0.000515 | 0.010648 |
| 13 | 0.000551 | 0.011751 |
| 14 | 0.000580 | 0.012330 |
| 15 | 0.000742 | 0.013815 |
| 16 | 0.000773 | 0.014588 |
| 17 | 0.000854 | 0.015442 |
| 18 | 0.000907 | 0.016348 |
| 19 | 0.001204 | 0.017552 |
| 20 | 0.001400 | 0.021752 |
| 21 | 0.002460 | 0.024212 |
| 22 | 0.002594 | 0.026806 |
| 23 | 0.004148 | 0.030953 |
| 24 | 0.009363 | 0.040317 |
| 25 | 0.011399 | 0.051715 |
| 26 | 0.032437 | 0.116588 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
# Complete the code to fit decision tree on training data
clf.fit(x_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.032436553118827774
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
When ccp_alpha is set to zero and keeping the other default parameters
of DecisionTreeClassifier, the tree overfits, leading to
a 100% training accuracy and 69% testing accuracy. As alpha increases, more
of the tree is pruned, thus creating a decision tree that generalizes better.
train_scores = [clf.score(x_train, y_train) for clf in clfs]
test_scores = [clf.score(x_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ', best_model.score(x_train, y_train))
print('Test accuracy of best model: ', best_model.score(x_test, y_test))
DecisionTreeClassifier(random_state=1) Training accuracy of best model: 1.0 Test accuracy of best model: 0.9842342342342343
recall_train = []
for clf in clfs:
pred_train3 = clf.predict(x_train)
values_train = metrics.recall_score(y_train, pred_train3)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test3 = clf.predict(x_test)
values_test = metrics.recall_score(y_test, pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)
confusion_matrix_sklearn(best_model, x_train, y_train)
Train performance
decision_tree_tune_perf_train = model_performance_classification_sklearn(
best_model, x_train, y_train)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
confusion_matrix_sklearn(best_model, x_test, y_test)
Test performance
decision_tree_tune_perf_test = model_performance_classification_sklearn(
best_model, x_test, y_test)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.984234 | 0.857143 | 0.923077 | 0.888889 |
plt.figure(figsize=(20, 30))
tree.plot_tree(best_model, feature_names=feature_names,
filled=True, fontsize=9, node_ids=True, class_names=True)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(pd.DataFrame(best_model.feature_importances_, columns=[
"Imp"], index=x_train.columns).sort_values(by='Imp', ascending=False))
Imp Income 0.365668 Education 0.313546 Family 0.118694 CCAvg 0.094199 Age 0.030165 CD_Account 0.024784 ZIPCode 0.024235 Experience 0.009678 Online 0.009392 Mortgage 0.005886 CreditCard 0.002337 Securities_Account 0.001417
best_model.tree_.node_count
101
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)),
importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, decision_tree_estimator_train.T,
decision_tree_tune_perf_train.T], axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn", "Decision Tree Tuned hyperparameters", "Decision Tree Cost Complexity Pruning"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree Tuned hyperparameters | Decision Tree Cost Complexity Pruning | |
|---|---|---|---|
| Accuracy | 1.0 | 0.988406 | 1.0 |
| Recall | 1.0 | 0.891192 | 1.0 |
| Precision | 1.0 | 0.919786 | 1.0 |
| F1 | 1.0 | 0.905263 | 1.0 |
models_train_comp_df = pd.concat(
[decision_tree_perf_test.T, decision_tree_estimator_test.T,
decision_tree_tune_perf_test.T], axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn", "Decision Tree Tuned hyperparameters", "Decision Tree Cost Complexity Pruning"]
print("Test performance comparison:")
models_train_comp_df
Test performance comparison:
| Decision Tree sklearn | Decision Tree Tuned hyperparameters | Decision Tree Cost Complexity Pruning | |
|---|---|---|---|
| Accuracy | 0.984234 | 0.981982 | 0.984234 |
| Recall | 0.857143 | 0.826531 | 0.857143 |
| Precision | 0.923077 | 0.920455 | 0.923077 |
| F1 | 0.888889 | 0.870968 | 0.888889 |
Training
Test
What recommendations would you suggest to the bank?
The feature importances show that the following features are the most important for predicting whether a customer will take out a loan:
The other features are less important for predicting whether a customer will take out a loan. However, they can still be used to improve the accuracy of the model.
Here are some additional insights that can be gained from the feature importances:
The comparison between logistic regression and decision tree shows that they have different strengths and weaknesses.
Logistic regression is a linear model that can be used to predict the probability of an event occurring. In this case, the event is a customer taking out a loan. Logistic regression is relatively easy to interpret and can be used to make predictions on new data. However, it can be sensitive to overfitting and may not be as accurate as decision trees.
Decision trees are a non-linear model that can be used to predict the probability of an event occurring. In this case, the event is a customer taking out a loan. Decision trees are relatively easy to train and can be more accurate than logistic regression. However, they can be difficult to interpret and may not be as generalizable to new data.
Ultimately, the best model for AllLife Bank will depend on their specific needs. If they are looking for a model that is easy to interpret and can be used to make predictions on new data, then logistic regression may be the best option. However, if they are looking for a model that is more accurate, then decision trees may be a better choice.
As you can see, decision trees have a higher accuracy, recall, and F1 score than logistic regression. However, logistic regression is easier to interpret and can be used to make predictions on new data.